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XML BASED SCIENTIFIC DATA MANAGEMENT FACILITY 

P. MEHROTRA* and m. zubair 1 

Abstract. The World Wide Web consortium has developed an Extensible Markup Language 
(XML) to support the building of better information management infrastructures. The scientific 
computing community realizing the benefits of XML has designed markup languages for scientific data. 
In this paper, we propose a XML based scientific data management facility, XDMF. The project is 
motivated by the fact that even though a lot of scientific data is being generated, it is not being shared 
because of lack of standards and infrastructure support for discovering and transforming the data. The 
proposed data management facility can be used to discover the scientific data itself, the transformation 
functions, and also for applying the required transformations. We have built a prototype system of the 
proposed data management facility that can work on different platforms. We have implemented the 
system using Java, and Apache XSLT engine Xalan. To support remote data and transformation 
functions, we had to extend the XSLT specification and the Xalan package. 

Key Words. XML, scientific data management, digital library 

Subject Classification. Scientific Data Management 

1. Introduction. We are entering the second phase of the World Wide Web revolution where 
the target for information is not a human, but a machine. In the first phase, a digital document was 
represented using HTML, which is rendered for display by browsers for human consumption. It was soon 
realized that HTML representation of a digital document has limitations. In particular, it makes the 
document unsuitable for machine processing, which is essential for building a distributed information 
infrastructure that can be efficiently searched and managed. The World Wide Web consortium has 
developed an Extensible Markup Language (XML) to support the building of better information 
management infrastructures. XML allows a community to describe its own grammar that meets its needs 
more efficiently. For example, it is now possible for a community to separate the structure of the 
document from its presentation. One can define a set of tags to represent the abstract structure of the 
document, which makes it suitable for machine processing. 

The scientific computing community, also realizing the benefits of XML, has designed markup 
languages to represent scientific data. There are several initiatives focusing on this issue, such as the 
Extensible Scientific Interchange Language (XSIL) [1], and the extensible Data Format (XDF) [2]. We 
hope that finally the community will agree on one language for the scientific data representation. We 
believe that this language will have two components: a core component describing the structure of the 
scientific data and the second, a discipline specific component, may contain metadata describing the 
circumstances of the data collection and other information for understanding the data details. 

A workshop on Interfaces to Scientific Data Archives organized by California Institute of 
Technology made a strong case for an XML based scientific data management infrastructure [3], In this 
paper, we propose a AML based scientific data management/acility (XDMF). The project is motivated by 
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the fact that even though a large amount of scientific data is being generated, both experimentally and 
programmatically, relatively little is being shared among the scientists because of lack of standards and 
infrastructure support for discovering and transforming the data. The proposed XDMF will make the 
process of discovering data along with the relevant transformations required for sharing such data, easier 
and more efficient. In particular, the focus is to automate this process and make it location independent 
such that the user, the data and the transformation code may be in distributed locations. Consider the 
situation in which a scientist wants to use some specific kind of data, for example, wind tunnel data, in the 
course of a simulation. The user visits the XDMF hosted, say in Virginia, and executes a search using 
some specific metadata fields. He is presented with a list of registered data satisfying his query. He selects 
one of the data sets from this list after examining the detailed description. The selected data is available 
from a site located, say in California (Note that the XDMF only keeps the XML document describing the 
data and not the data set itself). However, in many cases the data will be in a format not directly useful to 
the user and it would have to be transformed into another format before it can be utilized. The user can 
then search the XDMF for a list of applicable transformation functions. The user selects a transformation 
function located, say in Seattle. The XDMF retrieves the data from California, retrieves the 
transformation function from Seattle, applies the transformation and sends the transformed data to the 
user (we are assuming that the data and transformation functions are accessible through HTTP). In a more 
general scenario, the data would be required only at the time that the code is to be executed as a part of a 
larger application. In such situations, the proposed XDMF can be integrated into a larger framework and 
can facilitate the downloading and transformation of the data at runtime. 

We have built a prototype of the XDMF that can work on different platforms. We have 
implemented the system using Java, and Apache XSLT engine, Xalan. To support remote data and 
transformation functions, we had to extend the Extensible Style Language for Transformation (XSLT) 
specification and the Xalan package. For our initial prototype, we have used XSIL for representing the 
scientific data. Note that by doing this we are not endorsing any one initiative. Our objective is to 
demonstrate the benefits of an XML based data management facility. In fact, we also show that the 
current scientific data markup languages will need to be extended to build the proposed facility. 

The rest of the paper is organized as follows. In the next section we provide some background on 
XSIL, a XML based scientific data interchange language and XSLT, an XML transformation language. 
Third section presents an overview and architecture of the proposed facility while the prototype section 
provides a brief description of the current prototype. 

2. Background. 

2.1. Extensible Scientific Interchange Language (XSIL). The Extensible Scientific 
Interchange Language (XSIL) [1] has been developed by the Center for Advanced Computing Research, 
Caltech to represent the basic syntactic structure for scientific data such as Table, Array , and Stream in 
XML. The Table is similar to a relational table that contains an unordered set of records, each of the same 
format; the Array is collection of numbers of some other primitive data type; and the Stream element 
provides a link to external and encoded data through files and URL’s. Two sample XSIL documents, one 
representing small size local (in-line) data and the other representing remote data, are shown on the left 
and right side of Figure 1 respectively. 
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< ?xm! version = "1 . 0 ”? > 

<XSIL> 

<A rray Name = ” Coordinates ” Type= '[float ”> 
<Dim> 4</D\m> 

<Dim>2</D\m> 

<Stream Encodings "Text" Type=” Local” 
Delimiter^ ", ">/. 0, 1 , 1, 0, I, -1,1 
</Stream> 

</Arrav> 


< ?xml version = ”1. 0 "? > 

<XSIL> 

< Array Name-” Coordinates” Type=”float”> 
<Dim>4</Dim> 

<Dim>2</Dim> 

<Stream Type =" Remote” Deli miter = ”, "> 
data, dal </Stream > 

</Stream > 

</Arrav> 


FIGURE 1 . Sample XSIL documents with local data (left) and remote data (right) 


2.2. Transformations. The ease of transforming an XML document from one form into another 
is key to the XML usefulness. The transformations are typically necessitated when we move XML 
documents between two disparate organizations. In such a case, an XML document in one organization 
exists in a form different from the one in the other organization. This could be because the two 
organizations are using different languages to markup their data. For this purpose, the World Wide 
Consortium has introduced Extensible Style Language for Transformation (XSLT). One uses XSLT to 
write stylesheets, which essentially represent a set of instructions for transforming one XML document 
type to another. Note that you need an XSLT engine to process these instructions. An example of XSLT 
engine that is in public domain is Apache Xalan (http://www.apache.org). The XSLT specification also 
supports transformations like sorting of document elements, summing and averaging numbers, etc. 

3. XDMF. 

3.1. Overview. In this project we have focused on XDMF, an XML based scientific data 
management facility for discovering and transforming data sets stored in distributed locations. Figure 2 
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gives an overview of the functionality of the XDMF. The XDMF interacts with three entities: the data 
generator, the transformation function developer, and a distributed computing framework. The data 
generator is responsible for registering the scientific data that is to be shared with other researchers. For 
this he uploads the XSIL file describing the structure of the data along with other metadata providing 
semantic information about the data. For example, the metadata could contain information about the 
conditions and constraints under which the data was generated. The transformation function developer 
registers the transformation function by uploading the XSLT specification along with necessary metadata 
that describes the type of transformation, the function support and the constraints under which the 
transformation is applicable. We are basing our approach for transforming scientific data on the XSLT 
engine. As the required scientific data transformation could be complex, for example converting a node- 
centered unstructured grid data in a CFD simulation to an edge-centered format, it is not possible to 
describe these transformations in the XSLT specification file. In such situations we will use the facility 
provided by the XSLT specification for referencing external transformation functions. Given that in most 
cases we will have to use external transformation functions, the question arises: why use the XSLT 
specification at all? The reasons for using XSLT specification are: (1) the input data specification is in 
XML and the transformed data is also specified in XML thus necessitating the use of a XSLT engine, (2) 
development cost is low as the XSLT engine, which is a standard module freely available in public 
domain, provides support for all the other required work like downloading the transformation function 
and the scientific data from remote sites, applying the transformation function on the downloaded data, 
and storing the transformed data and its XML specification. 

We now illustrate the information flow by considering an application designer, working with a 
framework, who is in need of scientific data for his application. During the design phase of his 
application, he visits the XDMF and identifies a data set registered in the XDMF. Along with the data set, 
he also chooses an appropriate transformation function in the form of an XSLT specification. During the 
execution phase, the data management gateway requests the transformed data from the XDMF. The 
XDMF in turn, downloads the data and the transformation functions from remote locations, applies the 
transformation and returns the XML file describing the transformed data. The gateway software processes 
the XML file, retrieves the transformed data and supplies it to the application. 

3.1. Architecture. The architecture of the XDMF, as shown in Figure 3, consists of (1) a digital 
library that holds the scientific data specification, transformation function specification along with other 
metadata, (2) data transformation component based on Xalan XSLT engine that retrieves the data and the 
transformation function from remote sites and applies the transformation, and (3) publication, search, and 
transformation request handlers. All interactions with the XDMF are based on HTTP. The data generator 
interacts with the XDMF publication handler to publish the scientific data specification in XSIL along 
with other relevant other metadata about the data. Similarly, the transformation specification developer 
interacts with the publication handler to publish the transformation specification and its metadata. The 
application designer interacts with the search handler to discover and identify the scientific data and the 
transformation function in the digital library. The framework gateway initiates retrieval request for the 
transformed data to the transformation handler, which in turn interacts with the data transformation 
component. 

The Digital Library architecture is based on the Java-based search service that was developed for 
Joint Training, Analysis and Simulation Center (JTASC) [4], The benefit of this architecture is that it is 
platform independent, and it can work with any Web server as it is based on Java servlets. Moreover, the 
changes required to work with different databases are minimal. Our current implementation supports two 
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relational databases, one in the commercial domain (Oracle), and the other in public domain (MYSQL). 
The architecture employs a three-level caching scheme to improve performance [4], 

3.2. Extending XSLT Specification and XALAN. One major problem faced when using XSLT 
is its limited functionality, especially in performing complex scientific data transformations. The XSLT 
specification supports constructs for simple operations, such as, sorting and summation, only. However, 
scientific data transformations are in general much more complex. To address this issue, we have used the 
XSLT extension support to define new functions that include any scientific transformation logic and to 
associate them with Java classes. These functions can then be called in a XSLT specification. For this we 
have to make the following modifications in the XSLT specifications as shown in Figure 4. First, we have 
to declare an extra namespace for the extension along with an extension-element-prefix (lines 4-5, Figure 
4). Second, we declare the new function, polar3 here, and associate it with a remote Java class (lines 6-8, 
Figure 4). Lastly, we call the extension function, again polar3, in the appropriate transformation rule of 
the XSLT specification (line 14, Figure 4). 

The Xalan package does not provide support for a remote Java class, e.g., specified via a URL, 
which has been associated with the external functions. As described above, access to remote 
transformations is central to the design of the facility (see Figure 2). To provide this support, we had to 
extend the Xalan-Java processor to handle the extension function calls specified via a URL by modifying 
the ExtensionFunctionHandler.java in Xalan package org. apache. xalan. xpath. 

3.3. Prototype. We have implemented a standalone prototype that has the core functionality of 
the data transformation and digital library support. The current prototype allows users to select a scientific 
data specification along with the transformation to be applied. Once selected, the XMDF retrieves the 
scientific data from a remote location, say from www.icase.edu, retrieves the transformation function 
from, say www.cs.odu.edu, and applies the transformation, delivering the transformed data to the user 
over the Web. We have also implemented a Web based publication tool, which allows (a) the data 
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< ?xm l vers ion ='7.0 "? > 

<xsl: stylesheet xmlns:xsl~ "http://wwn\ w3.org/ 1 999/XSL/Transform " 
xmlns:axslt= "http://xml.apache.org/xslt " 

xmlns:mvxslt— http://www.cs.odu.edu/-zubair/demo/RemoteXSLTExtensions 
ex tens ion-el emen t-pre fixes = "myxslt " version = " 1.0 "> 

<axs It .component preflx= "myxslt "functions = "polar 3 "> 

<~axslt: script lang="javaclass” src="http://ww\\\cs. odu.edu/-zuhair/demo/RemoteXSLTExtensions. class"/> 
</uxs It: co mponent > 

<xsl: temp late m atch = " Stream ”> 

<Stream> 

<xsl: variable name= " stream " select= 7 > 

<xsl:value-of select = "myxslt: polar 3(string(@Delimiter), string/ Sstream)) "/> 

</Strcam> 

</xsl:template> 

</xsl: stylesheet > 


Figure 4. Modified XSLT specification for transforming data using an external function polar 3 


generator to upload the XSIL specification of the data along with other metadata into the digital 
repository, and (b) the transformation function developer to upload the transformation specification along 
with necessary metadata into the digital repository. 

4. Conclusion and Future Work. In this paper we have proposed a XML based data 
management facility. The proposed XDMF provides support for: (a) registration of XML documents 
describing scientific data, (b) registration of XML documents describing transformations functions, (c) 
association of a scientific data set with the available transformation functions, (d) searching and browsing 
of scientific data based on specific metadata fields, (e) transformation of data once the user has identified 
the data and the transformation, (f) remote data and remote transformation functions. The proposed 
XDMF is easy and efficient to build as it is based on XML standards, thereby allowing reuse of publicly 
available tools such as parsers, and transformation engines. In future, we plan to develop APIs for the 
XDMF to allow framework developers to write their own data manager gateway software for integration 
with XDMF. We plan to work with other researchers in standardizing a XML language for specifying 
transformation function and other discipline specific component of the scientific data. 

Acknowledgments. We are thankful to Tan and Jakatdar for implementing some of the modules 
for XMDF. 
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